Training robust T1-weighted magnetic resonance imaging liver segmentation models using ensembles of datasets with different contrast protocols and liver disease etiologies

Image segmentation of the liver is an important step in treatment planning for liver cancer. However, manual segmentation at a large scale is not practical, leading to increasing reliance on deep learning models to automatically segment the liver. This manuscript develops a generalizable deep learning model to segment the liver on T1-weighted MR images. In particular, three distinct deep learning architectures (nnUNet, PocketNet, Swin UNETR) were considered using data gathered from six geographically different institutions. A total of 819 T1-weighted MR images were gathered from both public and internal sources. Our experiments compared each architecture’s testing performance when trained both intra-institutionally and inter-institutionally. Models trained using nnUNet and its PocketNet variant achieved mean Dice-Sorensen similarity coefficients>0.9 on both intra- and inter-institutional test set data. The performance of these models suggests that nnUNet and PocketNet liver segmentation models trained on a large and diverse collection of T1-weighted MR images would on average achieve good intra-institutional segmentation performance.


Data curation and description
The inclusion criteria for MR images into our dataset are as follows: 1.The entire liver must be visible in the image.2. All eight liver segments must be present.For example, there is no history of hepatectomy or lobectomy before image acquisition.3. The image quality must be high enough such that the boundary of the liver is identifiable without using a pre-existing contour.
We manually inspected each image to determine if it met the selection criteria.This process also included using relevant patient and dataset metadata.The primary indicators to identify the liver segments include the presence or absence of the left and right portal veins and tissue homogeneity.We excluded images from patients who had undergone hepatectomy or lobectomy.This process resulted in a total of 819 T1-weighted MRIs from 312 patients.Of these, 72 patients had cirrhosis, a risk factor and common finding in patients with primary hepatocellular carcinoma, who underwent MRI obtained from the Duke Cancer Institute (data collected from the Duke Liver Dataset [DLDS]) 19 .Another 34 patients with liver cancer were obtained from The University of Texas MD Anderson Cancer Center.An additional 71 patients with hepatocellular carcinoma were collected from Houston Methodist Hospital.Fifty-eight anonymized patients from the A Tumor and Liver Automatic Segmentation (ATLAS) dataset with hepatocellular carcinoma were obtained from Bourgogne University in Dijon 22 .Another 57 patients with "abdominal tumors/ abnormalities" were obtained from the Longgang District People's Hospital in China with a protocol approved by the hospital's Research Ethics Committee (data collected from the Abdominal Multi-Organ Segmentation [AMOS] dataset) 28 .Although a small subset of these patients' scans showed tumor growth and lesions on the liver itself, most patients had unrelated abnormalities.Finally, 20 healthy individuals were collected from the Dokuz Eylul University Hospital's Department of Radiology in Izmir, Turkey, using an Institutional Review Board-approved protocol (data collected from the Combined Healthy Abdominal Organ Segmentation [CHAOS] dataset) 18 .
Ranges of repetition times, echo times, and contrast agents' protocol of the public datasets used are provided (whenever available in their corresponding paper) in Table 1. Figure 1 and Table 2 further summarize the datasets we used.

Swin UNETR
The Swin UNETR model is a deep learning architecture designed for medical image segmentation tasks, integrating the Swin Transformer with the UNETR framework 29,30 .It leverages the Swin Transformer's hierarchical feature representation and shift windowing mechanisms to capture global context and local details within medical images effectively.The model's architecture combines the strengths of vision transformers in encoding long-range dependencies and the U-Net's efficient up-sampling and localization capabilities, resulting in improved medical imaging segmentation.nnUNet Since its introduction, nnUNet has become a popular tool for use in medical image segmentation because its ability to automatically configure a preprocessing and deep learning training pipeline based on the properties of its training data eases the burden of manually developing models to suit a particular data modality 16 .We chose specifically to train 3D full-resolution U-Nets using nnUNet as a baseline for comparison against the other two models.

PocketNet
The PocketNet paradigm was originally proposed to reduce the number of parameters in CNN architectures while maintaining their accuracy 31 .This approach uses the similarity between geometric multigrid methods for solving linear systems arising from discretizing partial differential equations and CNNs to justify keeping the

Preprocessing protocols
We apply the same preprocessing steps for all models and datasets.Namely, we apply the rule-based analysis and preprocessing steps proposed by the nnUNet architecture authors.This resulting target spacing and patch size for each individual and combined dataset are given in Table 3.Because of the increased computational cost of the Swin UNETR architecture vs its CNN counterparts, we use a patch size of 128 × 128 × 64.

Hyperparameters, training, and evaluation protocols
We train each model using at least two A100 Nvidia GPUs with a batch size of twice the number of GPUs.All models are trained for at least 1000 epochs and use the same optimization parameters as the nnUNet framework.
Apart from the Swin UNETR model, we use deep supervision.Additionally, automatic mixed precision was used during training to reduce the time and memory requirements.All models use the Dice with cross-entropy loss.We use test-time augmentation (average prediction after flipping along each axis) and postprocess the final predictions by taking the largest connected component.To evaluate the validity of each predicted segmentation mask, we use the following metrics: the DSC, 95th percentile Hausdorff distance (HD 95), and surface dice with a tolerance of 2mm.We chose surface DSC specifically to offset the skew that the large internal volume of the liver can have on the DSC 32 .

Experimental design
Using the data and models described in the prior sections, we perform two experiments to evaluate each model's performance on MRI liver segmentation when trained on a single dataset and on ensembles of datasets.

Experiment 1: single source five-fold cross-validation
In this experiment, we perform a five-fold cross-validation with each model on each dataset separately.For each dataset, we set aside the first 20% of the data as an independent test set, take 10% of the remaining data as a validation set, and train on the remaining image-label pairs.We continue this process until we have test-time predictions for each image in a given dataset.This experiment aims to determine how each model performs on test images that come from the same distribution as the training data, which will serve as a baseline to compare how the same architectures perform on out-of-distribution examples in the following experiment.

Experiment 2: leave-one-dataset-out cross-validation
Following our first experiment, we trained and validated six models on all curated T1-weighted MR images, with each dataset withheld for testing.
While our first experiment would demonstrate how each model would perform when tested on in-distribution samples, our second experiment aims to evaluate our models' performance when tested on out-of-distribution examples.Our hypothesis with this second experiment is that the test-time performance on the withheld dataset would match or exceed the corresponding results from Experiment 1 only if the images in the training set are of similar quality or contrast protocol type to those of the withheld dataset.

Experiment 1: Single source five-fold cross-validation
Table 4 shows each metric's mean and standard deviation for each model resulting from a five-fold cross-validation on each dataset.We see that the PocketNet and nnUNet architectures generally achieve similar accuracy.However, both of these models outperform the Swin UNETR architecture.
For comparison, Fig. 2 provides boxplots of the DSC, HD 95, and surface DSC for Experiment 1.We see here that the nnUNet and PocketNet models show comparatively similar variations in accuracy, while the Swin UNETR shows the most variation.Outliers were caused primarily by under-segmentation of the liver, especially in the presence of motion or noise artifacts and large complex (solid/ cystic) liver masses, under-segmentation of a tumor or lesion (relatively large lesion along the boundary of the right margin of the liver with signal hypointensity), and over-segmentation of either the abdominal wall or surrounding organs, such as the spleen and kidney.Figure 3 shows the resulting image segmentation quality for a subset of images with these characteristics.
In Fig. 3, all three models performed poorly on the same MR image from the ATLAS dataset, which showed severe over-segmentation of the spleen and other surrounding structures.This common failure is believed to be due to the close similarity of signal intensity between the liver and the spleen and the lack of a distinct boundary between the two organs in this MR image.In the DLDS column of Fig. 3, all three models under-segmented this www.nature.com/scientificreports/case, although the Swin UNETR model contoured more of the liver than the other models.In this DLDS case, the imaging shows complex cystic solid masses.In the MDA column, all models under-segmented the right lobe of the liver on a portal venous phase MR image from a patient with a large homogeneous mass occupying Figure 4 shows accurate predictions from each model.Notable errors were under-segmentation and oversegmentation of the inferior vena cava, although this discrepancy could be attributed to inter-observer variability across datasets.

Experiment 2: leave-one-dataset-out cross-validation
Table 5 shows each metric's mean and standard deviation for each model resulting from a five-fold cross-validation on each dataset.Like with Experiment 1, the PocketNet and nnUNet architectures generally achieve similar accuracy while outperforming the Swin UNETR model.
Recall that our hypothesis for Experiment 2 was that each model's performance, when tested on a withheld dataset, would match or exceed the corresponding results from Experiment 1 only if the images in the training set were of similar quality or contrast protocol type to those of the withheld dataset.In other words, because of the differences between each dataset, we would expect to see a decrease in accuracy between each model in Experiment 2 vs. 1.This generally appears to be the case for PocketNet and nnUNet, with PocketNet recording overall better accuracy on the CHAOS dataset and nnUNet with the MD Anderson dataset.The Swin UNETR model does not appear to conform to our hypothesis.In this case, Swin UNETR reports improved mean DSC for the ATLAS, CHAOS, and MD Anderson datasets and HD 95 distances for all but the AMOS and MD Anderson datasets.
Figure 6 shows predicted segmentation masks whose DSC is lower than 0.8.We exclude AMOS and ATLAS since all three models achieved a DSC of at least 0.8 for nearly every example.When tested on a low-accuracy case from DLDS, the Swin UNETR model completely undersegmented the target organ, only labeling a tiny sliver of the right liver lobe.PocketNet and nnUNet over-segmented the abdominal region surrounding the front right liver lobe in the same image.We hypothesize that the models performed poorly on this DLDS case due to massive ascites (fluid around the liver) and shrunken cirrhotic liver.In the case of the MD Anderson column in Fig. 6, PocketNet and nnUNet only segmented the right liver lobe.Coincidentally, nnUNet's outlier was the same MR image that was its outlier when trained on this cohort in Experiment 1. Finally, all three models oversegmented the spleen when tested on their worst case from the Methodist dataset.

Error analysis
Low dice scores (DSC<0.8),indicated in Figs. 2 and 5, were manually reviewed across all models and experiments to characterize failure modes.In this analysis, we found that the most common failure modes are 1.Motion artifacts (Fig. 3 [DLDS]).2. Massive ascites (fluid around the liver) and shrunken cirrhotic liver (Fig. 6 [DLDS]).3. Similar signal intensities between the liver and surrounding regions (Fig. 3 [ATLAS]).4. The presence of a large infiltrative lesion (Figs. 3 and 6 [MDA]).Table 6 shows the frequency of each failure mode within our dataset.Here, unique image series are considered, i.e. repeat poor performance across models and experiments was counted once.Note that there were seven cases where we did not see any odd pathologies and could not determine why our models produced less accurate liver segmentation masks.

Discussion
We aimed to use three deep learning architectures and as many T1-weighted MR images as we could gather from multiple institutions to train a robust, accurate liver segmentation model for multiple MRI vendors and liver disease etiologies.From a supervised learning perspective, such a model must be trained on a sufficiently large and diverse cohort of MR images encompassing as many etiologies, contrast agent types, and artifacts as possible.
Our Experiment 2 models and their results, when tested on their respective withheld datasets, provided us with an approximation of how each of the three architectures might perform when confronted with a new dataset.Additionally, our results show that PocketNet and nnUNet are effective architectures for training accurate and robust models for MRI liver segmentation.These architectures achieved similar accuracy in Experiments 1 and 2 and showed similar variance.We hypothesized that the models in Experiment 2 would not outperform those in Experiment 1 because the withheld dataset is sufficiently different from the rest of the training data in each fold of Experiment 2. Our results for Experiment 2 using nnUNet and PocketNet support this hypothesis.We generally see a drop in accuracy for each model (PocketNet and nnUNet) except for the CHAOS dataset with PocketNet and the MD Anderson dataset with nnUNet.However, even in those non-conforming cases, the increase in performance is slight, with the only exception being the HD 95 with nnUNet.
Our hypothesis regarding the accuracy differences in Experiments 1 and 2 using the Swin UNETR model does not hold up as well as with the PocketNet and nnUNet models.One possible explanation is that transformer networks like Swin UNETR are data-hungry 33,34 .The bigger training set size for each fold in Experiment 2 may have helped alleviate this commonly seen challenge with vision transformers.Indeed, the difference in accuracy in the small CHAOS dataset ( n = 40 ) supports this.In Experiment 1, the Swin UNETR model could only use 32 images for training.On the other hand, this same model had 779 images to train with during Experiment 2.
We might consider the drop in performance observed across all three models when tested on the DLDS in Experiment 2 compared with how they performed when trained only on this dataset in Experiment 1 as supporting evidence for our hypothesis, given the large amounts of motion and susceptibility artifacts that are present in the dataset 19 , more so than any other dataset that we used.These artifacts most probably contributed to the Swin UNETR model's drop in performance, as its surface DSC values were the lowest of all three models when tested on DLDS in Experiment 2, and these artifacts also likely worsened the performance of PocketNet and nnUNet.However, another reason for the worsened performance of the models could be the liver shape and appearance changes caused by cirrhosis, which would suggest that liver disease etiology was a more significant confounding factor than image quality or contrast type.
Our results present evidence for and against our Experiment 2 hypothesis, and the lack of information regarding contrast types for the AMOS dataset or echo and repetition times for both AMOS and CHAOS are further complications that prevent us from making a proper conclusion on this hypothesis.
Maximum segmentation accuracy is necessary for precise localization and characterization of the liver tissue and the accompanying pathology, which aids radiologists and surgeons in optimizing the diagnosis and staging of the disease, and this is considered the cornerstone of management and treatment planning in terms of surgery and radiological intervention.An automated, robust segmentation model will give a more reproducible estimate of the volumetric measures and extent of liver tissue/lesion than manual or semi-automated methods, as these could be biased or subject to interobserver variability [12][13][14] .Automated models are still in their developmental stages, and their underperformance regarding segmentation accuracy can result in suboptimal patient outcomes.For example, under-segmentation can lead to the persistence of residual tissue after resection or chemoembolization, whereas over-segmentation can result in unnecessary interventions and inaccurate estimation of the residual liver volume and function during surgical planning 8 .Future work will further evaluate the impact of the observed failure modes in Table 6 on segmentation accuracy.Table 6.Characterization of image features that result in low DSC for all three models and both experiments.

Finding
No. images

Motion artifact 9
Ascites and cirrhosis 8 Similar signal intensity 5

Large infiltrative lesion 4
Hernia 2 No odd pathology 7 Total 35 Of the three architectures tested, our results indicate that PocketNet and nnUNet are effective architectures for training accurate and robust models for MRI liver segmentation.While the Swin UNETR model was not as accurate as its CNN counterparts in either experiment, improving its performance with pre-training on large, publicly available CT datasets like LiTS or TotalSegmentor may be possible.This line of inquiry is a possible direction for future work.The differences in the number of parameters in each architecture are also worth noting.PocketNet has roughly 800,000 parameters, nnUNet has roughly 31,000,000, and Swin UNETR has roughly 62,000,000 parameters.Our results show similar performance between PocketNet (a pocket version of nnUNet) and the full-sized nnUNet, further validating the results from the original PocketNet paper 31 .While PocketNet and nnUNet show similar accuracy, it is also important to point out that the reduced computational cost of PocketNet (from having fewer parameters) makes training and deploying such a model more suitable for resource-constrained environments that might not have access to the latest GPUs (or GPUs in general).
Our work built upon existing research by training the proven nnUNet and its Pocketnet variant on the task of segmenting the liver using 819 T1-weighted MR images gathered mostly from liver cancer patients with different contrast protocols, with performance ranging from comparable to superior when compared against existing models 15,16,20,21,23,24 .However, unlike Lambert et al. 's AHUNets 21 , we did not distinguish between the liver and the tumor and counted the latter as part of the former.
Of the six datasets we used in our experiments, only AMOS, ATLAS, CHAOS, and DLDS are publicly available.As a result, only the results from Experiment 1 with these specific datasets will be reproducible.Furthermore, although curating multiple datasets allowed us to build a sizable and diverse group of MR images for our work, these datasets were labeled by different people.This interobserver variability between ground truth masks is another important confounding factor.Unfortunately, unless one or more trained radiologists are willing to manually edit over 800 liver contours to ensure uniformity across datasets, this limitation has no easy fix.
Work by Isensee et al. that compared the rankings of models submitted to a kidney and kidney tumor segmentation challenge indicated that changes to external parameters such as the learning rate, patch sizes, loss functions, and preprocessing schemes had a more significant impact on performance than changes to actual network architecture 16 .Future work might involve refinement of the "method configuration, " as Isensee et al. collectively referred to these parameters, to determine their effect on liver segmentation accuracy.Additional avenues of exploration include further training of our models on any additional T1-weighted liver MRI datasets that have been made public since the start of our research (i.e., TotalSegmentor MRI 35 ), applying our methodology to T2-weighted MRI datasets, training on a combined T1 and T2-weighted dataset, or further cross-sequence fusion across additional imaging modalities.Additionally, future work will also involve image denoising.Cui et al. recently used a 2D CNN and k-space analysis to reduce and remove motion artifacts from corrupted T2-weighted brain MR images 36 .Given both the prevalence of motion artifacts in DLDS and the fact that such artifacts are not uncommon in a clinical setting 19 , an algorithm that can be applied to remove motion artifacts from liver MR images would expedite the training of robust deep learning segmentation models to assist in preventive surgery.

Conclusions
We sought to train a robust and generalizable liver T1-weighted MRI segmentation model across different contrast protocols and disease etiologies.Of the architectures we trained using an ensemble of curated data drawn from multiple datasets, we found that models trained using PocketNet and nnUNet were the most robust to changes in image and target organ appearance due to a difference in imaging or health factors.We observed this trend across all six datasets, suggesting that any PocketNet or nnUNet model trained on an ensemble of T1-weighted MR images of similar or greater size and diversity will also demonstrate this generalizability.

Fig. 2 .
Fig. 2. Boxplots for Experiment 1. (A) DSC for the six datasets and three models, (B) surface DSC for the same, and (C) HD 95.We see here that the nnUNet and PocketNet models show comparatively similar variations in accuracy, while the Swin UNETR shows the most variation.

Fig. 3 .Figure 7
Fig.3.Examples of poorly predicted segmentation masks from all three models in Experiment 1.In the case of the MDA image, we see a large solid lesion on the liver boundary whose signal intensity is close to its surroundings, resulting in under-segmentation.For the ATLAS case, we see close signal intensity between the liver and the spleen, resulting in over-segmentation.For the DLDS case, we see a motion artifact resulting in under-segmentation.

Fig. 5 .
Fig. 5. Boxplots for Experiment 2. (A) DSC for the six datasets and three models, (B) surface DSC for the same, and (C) HD 95.We see here that the nnUNet and PocketNet models show comparatively similar variations in accuracy, while the Swin UNETR shows the most variation.

Fig. 6 .Fig. 7 .
Fig.6.Examples of poorly predicted segmentation masks from all three models in Experiment 2. Like with Experiment 1, we see a large lesion on the liver boundary whose signal intensity is close to its surroundings, resulting in under-segmentation in the same MDA case.In the DLDS case, we see massive ascites (fluid around the liver) and shrunken cirrhotic liver, resulting in under-segmentation for the Swin UNETR model and oversegmentation for the PocketNet and nnUNet models.For the Methodist case, the liver and spleen have similar signal intensities, resulting in over-segmentation. https://doi.org/10.1038/s41598-024-71674-y

Table 1 .
Summary of echo time (TE), repetition time (TR), and contrast agents used in MRIs.number of features at each resolution constant.In contrast, traditional CNNs double the number of features when going from higher to lower resolutions.As a result, PocketNet architectures reduce the number of parameters in CNN architectures by several orders of magnitude and have been shown to achieve similar accuracy to traditional CNNs.Here, we apply the PocketNet paradigm to the nnUNet architecture and refer to this architecture as PocketNet for the sake of conciseness.

Table 4 .
The mean (standard deviation) for each model's DSC, HD 95, and surface DSC for Experiment 1 -a five-fold cross-validation on each dataset.We highlight the best values across all metrics in bold.The PocketNet and nnUNet architectures are comparable and outperform the Swin UNETR model.

Table 5 .
The mean (standard deviation) for each model's DSC, HD 95, and surface DSC for Experiment 2 -a leave-one-dataset-out cross-validation.We highlight the best values across all metrics in bold.The PocketNet and nnUNet architectures are comparable and outperform the Swin UNETR model.